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Abstract 

Turkic languages exhibit extensive and diverse etymological relationships among lexical items. These relationships make the Turkic 
languages promising for exploring automated translation lexicon induction by leveraging cognate and other etymological information. 
However, due to the extent and diversity of the types of relationships between words, it is not clear how to annotate such information. In 
this paper, we present a methodology for annotating cognates and etymological origin in Turkic languages. Our method strives to balance 
the amount of research effort the annotator expends with the utility of the annotations for supporting research on improving automated 
translation lexicon induction. 


1. Introduction 

Automated translation lexicon induction has been investi¬ 
gated in the literature and shown to be feasible for vari¬ 
ous language families and subgroups, such as the Romance 

(Mann and Yarowsky] 

i. Although there have 
been some studies investigating using Swadesh lists of 
words to identify Turkic language groups and loanword 
candidates ( jvan der Ark et al., 2007] ), we are not aware of 
any work yet on automated translation lexicon induction for 
the Turkic languages. 

However, the Turkic languages are well suited to exploring 
such technology since they exhibit many diverse lexical re¬ 
lationships both within family and to languages outside of 
the family through loanwords. For the Turkic languages, it 
is prudent to leverage both cognate information and other 
etymological information when automating translation lex¬ 
icon induction. However, we are not aware of any corpora 
for the Turkic languages that have been annotated for this 
information in a suitable way to support automatic transla¬ 
tion lexicon induction. Moreover, performing the annota¬ 
tion is not straightforward because of the range of relation¬ 
ships that exist. In this paper, we lay out a methodology 
for performing this annotation that is intended to balance 
the amount of effort expended by the annotators with the 
utility of the annotations for supporting computational lin¬ 
guistics research. 


languages and the Slavic languages 


2001 1 Schafer and Yarowsky, 2002 


2. Main Annotation System 

We obtained the dictionary of the Turkic languages 
(Oztopgu et al., 1996). One section of this dictionary con¬ 
tains 1996 English glosses and for each English gloss a 
corresponding translation in the following eight Turkic lan¬ 
guages: Azerbaijani, Kazakh, Kyrgyz, Tatar, Turkish, Turk¬ 
men, Uyghur, and Uzbek. Table |T] shows an example for 
the English gloss ‘alive.’ When a language has an official 
Latin script, that script is used. Otherwise, the dictionary’s 
transliteration is shown in parentheses. Our annotation sys¬ 
tem is to annotate each Turkic word with a two-character 
code. The first character will be a number indicating which 
words are cognate with each other and the second charac¬ 
ter will indicate etymological information. Subsection|2.1.| 


discusses how to define and annotate cognates and subsec¬ 
tion 2.2. discusses how to define and annotate etymological 


information. 


2.1. Cognates 

According to the Oxford English Dictionary Online]^] ac¬ 
cessed on February 2, 2012, ‘cognate’ is defined as: “...Of 
words: Coming naturally from the same root, or represent¬ 
ing the same original word, with differences due to sub¬ 
sequent separate phonetic development; thus, English five, 
Latin quinque , Greek "sure, are cognate words, represent¬ 
ing a primitive *penkeT As this definition shows, shared 
genetic origin is key to the notion of cognateness. A word 
is only considered cognate with another if both words pro¬ 
ceed from the same ancestor. Nonetheless, in line with the 
conventions of previous research in computational linguis¬ 
tics, we set a broader definition. We use the word ‘cog¬ 
nate’ to denote, as in ( jKondrak, 2001) ): “...words in differ¬ 
ent languages that are similar in form and meaning, without 
making a distinction between borrowed and genetically re¬ 
lated words; for example, English ‘sprint’ and the Japanese 
borrowing ‘supurinto’ are considered cognate, even though 
these two languages are unrelated.” These broader criteria 
are motivated by the ways scientists develop and use cog¬ 
nate identification algorithms in natural language process¬ 
ing (NLP) systems. For cross-lingual applications, the ad¬ 
vantage of such technology is the ability to identify words 
for which similarity in meaning can be accurately inferred 
from similarity in form; it does not matter if the similarity 
in form is from strict genetic relationship or later borrow¬ 
ing. 

However, not every pair of apparently similar words will 
be annotated as cognate. For them to be considered cog¬ 
nates, the differences in form between them must meet a 
threshold of consistency within the data. We will explain 
the definitions and rules for the annotators to follow in or¬ 
der to establish such a threshold. 

First, we elaborate on how our notion of cognate differs 
from that of strict genetic relation. At a high level, there 
are two cases to consider: A) where the words involved are 
native Turkic words, and B) where the words involved are 

1 http://www.oed.com/view/Entry/35870?redirectedFrom=cognate 
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shared loanwords from non-Turkic languages. Within case 
A, there are two cases to consider: (Al) genetic cognates; 
and (A2) intra-family loans. Table [2] shows an example of 
case Al. This example shows the English gloss ‘one’ for 
all eight Turkic languages, descended from the same pos¬ 
tulated form, *bir, in Proto-Turkic ( [Rona-Tas, 2006) 1. Case 
Al is the strict definition of ‘cognate,’ and these are to be 
annotated as cognate. 


Case A2 is for intra-family loans, i.e., a word of ultimately 
Turkic origin borrowed by one Turkic language from an¬ 
other Turkic language. These cases, contrary to the strict 
definition, are to be marked as cognate in our system. An 
example is the modern Turkish neologism almag ‘alterna¬ 
tion, permutation’, incorporated from the Kyrgyz ( almag) 
‘change’ ( Turk Dil Kurumu, 1942| >. While rare, it is used 
today in Turkish scholarly literature to describe concepts in 
areas such as mathematics and botany. Processing genetic 
cognates (case Al) and intra-family loans (case A2) differ¬ 
ently would have little impact on the success of a cross¬ 
dictionary lookup system. In fact, accounting for the dif¬ 
ference might limit the efficacy of such a system. Also, the 
time depth of intra-Turkic borrowings may be centuries or 
mere decades. The more distant the borrowing the more 
difficult it will be for annotators to distinguish between 
cases Al and A2. Hence, instances of case A2 are to be 
annotated as cognate in our system]^ 


Case B is for situations of shared loanwords, where the 
source of the words is ultimately non-Turkic. There are 
three subcases: (Bl) loanwords borrowed from the same 
non-Turkic language; (B2) loanwords borrowed from dif¬ 
ferent non-Turkic languages, but of the same ultimate ori¬ 
gin; and (B3) loanwords of non-Turkic origin borrowed via 
another Turkic language. 


Table[3]shows an example of case Bl, the word ‘book,’ bor¬ 
rowed from Arabic in all eight Turkic languages. Table [4] 
shows an example of case B2, the word ‘ballet,’ borrowed 
from Russian in all cases except Turkish, where it was bor¬ 
rowed directly from the French. Table [5] shows an exam¬ 
ple of case B3: the word ‘benefit’ in Kyrgyz was borrowed 
most likely through Uzbek or Chaghatay ( Kirchner, 2006j >, 
but the Uzbek word was borrowed from Persian, and ul¬ 
timately from Arabic. It is difficult and time-consuming 
for annotators to make these fine-grained distinctions. And 
again, for computational processing, such distinctions are 
not expected to be helpful. Hence, all of cases Bl, B2, and 
B3 are to be annotated as cognate in our system. 


Recall that all our annotations are two-character codes; the 
first character is a number from one to eight indicating what 
words are cognate with each other. Table [6] shows the first 
character of the annotations for the example from Table [I] 
The words marked with 1 are cognate with each other and 
the words marked 2 are cognate with each other. 


2 For similar reasons, false cognates may be annotated as cog¬ 
nate if the annotator does not have readily available knowledge 
indicating that they are false cognates. Although this is a potential 
limitation of our system, it is not clear how to distinguish false 
cognates from true cognates without significant additional anno¬ 
tation expense. 


2.2. Etymology 

The second character in a word’s annotation indicates a 
conjecture about etymological origin, e.g., T for Turkic. 
The decision to annotate word origin is motivated by its 
value for facilitating the development of technology for 
cross-language lookup of unknown forms. We therefore 
take a practical approach, balancing the value of the an¬ 
notation for this purpose with the amount of effort required 
to perform the annotation. We have created the following 
code for annotating etymology: 

T Turkic origin. This includes compound forms and af¬ 
fixed forms whose constituents are all Turkic. For 
example, the Turkmen for ‘manager’, yolbaggy, is 
marked T because its compound base, yol with bag, 
and affix -gy are all Turkic in origin. 

A Arabic origin, to include words borrowed indirectly 
through another language such as Persian. For ex¬ 
ample, the word in every Turkic language for ‘book’ 
is marked A for all eight Turkic languages. Because 
variations on the Arabic form /kita:b/ exist in every 
Turkic language, in Persian, and in other languages of 
the Islamic world, it is difficult to tease out the word’s 
trajectory into a language such as Kyrgyz. The burden 
of researching these fine distinctions is not placed on 
the annotator, as explained below. 

P Persian origin, not including Arabic words possibly bor¬ 
rowed through Persian. An example is the word for 
‘color’ in many Turkic languages, from the Persian 
/rang/. 

R borrowed from Russian, including words that are ulti¬ 
mately of French origin. 

F French origin, not including ultimately French words 
borrowed from Russian. Direct French loans occur al¬ 
most exclusively in Turkish. An example is the word 
for ‘station’ in Turkish, istasyon. 

E English origin. For example the word for ‘basketball’ in 
every language. 

I Italian origin. Usually of importance only to specific do¬ 
mains in Turkish. 

G Greek origin. For example, the word in Azerbai¬ 
jani, Turkish, Turkmen, Uyghur, and Uzbek for ‘box’ 
comes from the Greek kovti. 

C Chinese origin, usually Mandarin and usually of impor¬ 
tance only to Uyghur. An example is the word for 
‘mushroom’ in Uyghur, ( mogu ). 

Q unknown or inconclusive origin. 

The careful reader will have noticed that there is an incon¬ 
sistency in that words of ultimately Arabic origin borrowed 
through Persian are marked as A, but words of ultimately 
French origin borrowed through Russian are marked as R. 
There are two reasons for this. The first is annotator ef¬ 
ficiency. Making the judgment that a word is ultimately 
of Arabic origin is much easier than having to figure out 











whether it was borrowed from Arabic or indirectly from 
Persian. For the Russian/French situation, the distinction is 
much easier to make. To begin with, the Russian loanwords 
occur almost exclusively in former USSR languages and 
the French loanwords occur almost exclusively in Turkish. 
Also, the orthography often gives clear cues for making this 
distinction, as Russian loanwords consistently retain char¬ 
acteristically Russian letters. 


2.2.1. Multi-Language Exceptions 

We also define other codes that categorize certain complex 
words that do not fall into any of the categories described 
in subsection 2.2. Other etymological annotation studies, 
such as the Loanword Typology project and its World Loan¬ 
word Database ( jHaspelmath and Tadmor, 2009| >, have in¬ 
structed linguists to pass over such complex words and op¬ 
tionally flag them as “contains a borrowed base,” etc. Our 
annotation system requires that these words, which are very 
common in Turkic languages, be annotated according to 
more fine grained categories. 

The following are our multi-language exception codes: 


X Compound words where the constituents are from differ¬ 
ent origins. For example, the Tatar word for ‘truck’, 
(yak mashinasi ), is to be marked X since it contains 
Russian-origin ( mashina ), ’machine, vehicle’ in com¬ 
pound with Tatar (yok ), ‘baggage.cargo.’ In con¬ 
trast, the Turkish compound word for thunder, gdk 
giirlemesi, will be marked T because all of its con¬ 
stituents are Turkish. 


V A verb formed by combining a non-Turkic base with a 
Turkic auxiliary verb or denominal affix. For example, 
the verb ‘to repeat’ in Azerbaijani, Tatar, and Turkish, 
because it consists of a noun borrowed from the Arabic 
/takra:r/ plus a Turkic auxiliary verb et- or it-. 

N A nominal consisting of a non-Turkic base bearing one 
or more Turkic affixes, in cases where removing the 
affixes results in a form that can plausibly be found 
elsewhere in the data or in a loan language dictio¬ 
nary. For example, the Kazakh word for ‘baker,’ (naw- 
bayshi), is composed of a Persian-origin base, from 
/na:nva:/, ‘baker’, and a suffix that indicates a per¬ 
son associated with a profession, ( -shi). The Turkmen 
word for ‘baker,’ ( gdrekgi ), on the other hand, will be 
marked T, because both its base ( gorek ) and affix (-gi) 
are Turkic. 

Table [7] shows an example of an entry that has been fully 
annotated for both cognates and etymology. 


3. Inter-Annotator Agreement 

We pilot-tested our annotation system with two annotators 
on 400 etymology annotations]^] Both annotators have stud¬ 
ied linguistics. Also, both are native English speakers with 
experience studying or speaking multiple Turkic languages, 
Persian, and Arabic. Training consisted of studying the au¬ 
thors’ annotation manual and asking any follow-up ques¬ 
tions. Both annotators made approximately 240 annota¬ 
tions per hour. 


3 Table [8] has 392 entries because the annotators claimed eight 
entries had multiple translations for the same English gloss. 


Table [8] shows the contingency matrix for annotating the 
400 entries]^] From Table [8] it is immediate that agreement 
is substantial, and when there is disagreement it is largely 
for the difficult cases of inconclusive origin and the multi¬ 
language exceptions: Q, X, V, and N. We measured inter¬ 
annotator agreement using Cohen’s Kappa (|Cohen, 1960jl 
and found Kappa = 0.5927 (95% Cl = 0.5192 to 0.6662). 
If we restrict attention to only the instances where neither 
of the annotators marked an inconclusive origin or multi¬ 
language exception, then Kappa is 0.9216, generally con¬ 
sidered high agreement. This shows that our annotation 
system is feasible for use and also shows that to improve the 
system we might focus efforts on finding ways to increase 
agreement on the annotation of the exceptional cases (Q, X, 
V, and N). 
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0 

1 
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0 
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6 
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1 


Table 8: Table of Counts for two annotators’ etymological 
conjectures on 392 words. Annotator l’s conjectures follow 
the horizontal axis, and annotator 2’s the vertical. 


4. Conclusions and Future Work 

The Turkic languages are a promising candidate family of 
languages to benefit from automated translation lexicon in¬ 
duction. A necessary step in that direction is the creation 
of annotated data for cognates and etymology. However, 
this annotation is not straightforward, as the Turkic lan¬ 
guages exhibit extensive and diverse etymological relation¬ 
ships among words. Some distinctions are difficult for an¬ 
notators to make and some are easier. Also, some distinc¬ 
tions are expected to be more useful than others for au¬ 
tomating cross-lingual applications among the Turkic lan¬ 
guages. We presented an annotation methodology that bal¬ 
ances the research effort required of the annotator with the 
expected value of the annotations. We surveyed and ex¬ 
plained the wide range of the most important relationships 
observed in the Turkic languages and how to annotate them. 
When we finish the annotations, we would like to make the 
annotated data available as long as it is legal under copy¬ 
right laws for us to do so. Finally, we hope that our annota¬ 
tion system and the associated discussion can be useful for 
other teams that are annotating Turkic resources, and per¬ 
haps parts of it can be useful for annotating resources for 
other language families as well. 


4 We left out columns for English, Greek, Italian, and Chinese, 
which were not relevant for the 50 entries (according to unani¬ 
mous agreement of our annotators). 
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Azerbaijani 

Kazakh 

Kyrgyz 

Tatar 

Turkish 

Turkmen 

Uyghur 

Uzbek 

canlt 

(tiri) 

(tiiriiii) 

(janh) 

canlt 

diri 

(tirik) 

tirik 


Table 1: Example entry from the eight-way dictionary for the English gloss ‘alive.’ 


Azerbaijani 

Kazakh 

Kyrgyz 

Tatar 

Turkish 

Turkmen 

Uyghur 

Uzbek 

bir 

(bir) 

(bir) 

(her) 

bir 

bir 

(bir) 

bir 


Table 2: Example of case Al: genetic cognates. The English gloss is ‘one.’ 


Azerbaijani 

Kazakh 

Kyrgyz 

Tatar 

Turkish 

Turkmen 

Uyghur 

Uzbek 

kitab 

(kitap) 

(kitep) 

(kitap) 

kitap 

kitap 

(kitab) 

kitob 


Table 3: Example of case Bl: loanwords borrowed from the same non-Turkic language. The English gloss is ‘book.’ 


Azerbaijani 

Kazakh 

Kyrgyz 

Tatar 

Turkish 

Turkmen 

Uyghur 

Uzbek 

balet 

(balet) 

(balet) 

(balet) 

bale 

balet 

(balet) 

balet 


Table 4: Example of case B2: loanwords borrowed from different non-Turkic languages, but of the same ultimate origin. 
The English gloss is ‘ballet.’ 


Azerbaijani 

Kazakh 

Kyrgyz 

Tatar 

Turkish 

Turkmen 

Uyghur 

Uzbek 

fayda 

(payda) 

(payda) 

(fayda) 

fayda 

peyda 

(payda) 

foyda 


Table 5: Example of case B3: loanwords of non-Turkic origin borrowed via another Turkic language. The English gloss is 
‘benefit.’ 


Azerbaijani 

Kazakh 

Kyrgyz 

Tatar 

Turkish 

Turkmen 

Uyghur 

Uzbek 

canlt 

(tiri) 

(turirii) 

(janh) 

canlt 

diri 

(tirik) 

tirik 

1 

2 

2 

1 

1 

2 

2 

2 


Table 6: Example with cognates annotated. 


Azerbaijani 

Kazakh 

Kyrgyz 

Tatar 

Turkish 

Turkmen 

Uyghur 

Uzbek 

stul 

(orindtq) 

(orunduk) 

(urtndtk) 

sandalye 

stul 

(orunduq) 

kursi 

1R 

2T 

2T 

2T 

3A 

1R 

2T 

4A 


Table 7: Example with complete annotation both for cognates and etymology. The English gloss here is ‘chair.’ 










